Sentiment Hackpad

Authors:

Daniela Huppenkothen, Phil Marshall, Madhura Killedar

We did some natural language processing by performing sentiment analysis on the 2016 AstroHackWeek Hackpad


In [22]:
!pip install textblob


Requirement already satisfied (use --upgrade to upgrade): textblob in /Users/discworld/miniconda3/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): nltk>=3.1 in /Users/discworld/miniconda3/lib/python3.5/site-packages (from textblob)

In [23]:
from __future__ import unicode_literals, print_function
import textblob
import pandas as pd
import numpy as np

Test data

As a quick test, we feed some text into the textblob sentiment analyzer.

polarity can range from -1 to 1.

  • -1 reflects extreme negative associations
  • 1 reflects extreme positive associations
  • 0 is neutral language

In [24]:
textblob.TextBlob("Hello World I hate you").sentiment.polarity


Out[24]:
-0.8

In [25]:
textblob.TextBlob("Hello World I love you").sentiment.polarity


Out[25]:
0.5

Hackpad data

Read data

To do: Find a more automatic text-scraping method


In [26]:
#textfile = "../hackpadtext_Wed.txt"
textfile = "../hackpadtext_Thu.txt"
#textfile = "../hackpadtext_Thu_active.txt"

In [27]:
rawdata = pd.read_csv(textfile, header=None, names=["text"], sep="\n", encoding="utf-8")

Analyse

Analyse and store polarity of each chunk


In [28]:
rawdata["polarity"] = np.zeros_like(np.array(rawdata.columns["0"]))

In [29]:
# analyse each data/hack idea
feelings = []
for i in rawdata.index:
    data = rawdata.loc[i].values[0]
    polarity = textblob.TextBlob(data).sentiment.polarity
    rawdata.loc[i,"polarity"] = polarity
    feelings.append(polarity)

How happy are we on average?


In [30]:
average_feels = sum(feelings)/len(feelings)
print(average_feels)


0.180728006749

In [31]:
if average_feels>0:
    print("Yay, we're happy! wooooooooooo!")
else:
    print("oh no not happy jan")


Yay, we're happy! wooooooooooo!

Who sounds sad?


In [32]:
# search for sad hacks
rawdata[rawdata["polarity"]<0]


Out[32]:
text polarity
0 Active Projects: -0.133333
1 Move your project up here if it is being activ... -0.166667
12 AstroHackWeek image Gallery - (Arna) Image gal... -0.0375
21 Deprojecting Galaxies (or molecular structure)... -0.0111111
26 Here's my ongoing failure in notebook form -0.316667
35 Classifying the pulse shapes of pulsars using ... -0.0218182
36 A custom Monte Carlo sampler for the Kepler pr... -0.225952
40 Making MCMC fail on problems with implicit, fl... -0.00625
50 Classifying the pulse shapes of pulsars using ... -0.0218182
60 Modelling 2-D Impulse Response Function for Ac... -0.129167
73 Managing Large Scale Structure Data with Datab... -0.0111772
84 Neural Networks (Zaki Ali) - I'm working on a ... -0.148864
92 Start with a single species (say FeII), conver... -0.0107143
102 Bayesian networks for inference of young star ... -0.11
105 Python API to perform SDSS SQL Queries: Sky Se... -0.09375

In [33]:
# search for happy hacks
#rawdata[rawdata["polarity"]>0]

In [34]:
# Top Five Happy Hacks!
rawdata.sort_values("polarity")[::-1][:5]


Out[34]:
text polarity
7 Tips and Tricks for Teaching with Jupyter Note... 1
100 Lunch sounds good! 0.875
111 happy to chat about uncertainty and implementi... 0.8
98 A good point of reference: streams. Hope to jo... 0.7
62 Sure, sounds good! 0.6875

Wait... most of those sound like comments, not hacks!

Hackpad data (filtering out short comments)

Now, we'll assume and hope that a chunk of text with more than 20 words is an actual hack project idea as opposed to a comment. This isn't always true, so there's room for improvement.


In [35]:
rawdata["mask"] = np.zeros_like(np.array(rawdata.columns["0"]))

In [36]:
# select only 
for i in rawdata.index:
    if len(rawdata.loc[i,"text"].split(" "))>20:
        rawdata.loc[i,"mask"] = True
    else:
        rawdata.loc[i,"mask"] = False

New dataset only includes hacks, not comments


In [37]:
hackdata = rawdata[rawdata["mask"]]

In [38]:
#Top Five Sad Actually-Hacks (probably)
hackdata.sort_values("polarity")[:5]


Out[38]:
text polarity mask
36 A custom Monte Carlo sampler for the Kepler pr... -0.225952 True
84 Neural Networks (Zaki Ali) - I'm working on a ... -0.148864 True
60 Modelling 2-D Impulse Response Function for Ac... -0.129167 True
102 Bayesian networks for inference of young star ... -0.11 True
105 Python API to perform SDSS SQL Queries: Sky Se... -0.09375 True

In [39]:
# Top Five Happy Actually-Hacks (probably)
hackdata.sort_values("polarity")[::-1][:5]


Out[39]:
text polarity mask
7 Tips and Tricks for Teaching with Jupyter Note... 1 True
6 Gaussian Process Tutorial (Jake/Phil) We start... 0.625 True
95 Long-shot: if we finish the automatic velocity... 0.5 True
39 Create color palettes for custom queries (Adri... 0.5 True
30 collaboratr (Mike Baumer, Usman Khan, Casey L... 0.5 True

Repeat analysis from earlier


In [40]:
moarfeelings = []
for i in hackdata.index:
    data = hackdata.loc[i].values[0]
    polarity = textblob.TextBlob(data).sentiment.polarity
    moarfeelings.append(polarity)

In [41]:
average_feels = sum(moarfeelings)/len(moarfeelings)
print(average_feels)


0.168875721776

In [42]:
if average_feels>0:
    print("YAY, WE'RE ACTUALLY HAPPY! wooooooooooo!")
else:
    print("oh no we're actually sad")


YAY, WE'RE ACTUALLY HAPPY! wooooooooooo!

In [ ]: